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Abstract In this paper, we report a multiple sequence alignment result on the basis of 10 amino acid sequences of the M protein, 
which come from different coronaviruses (4 SARS-associated and 6 others known). The alignment model was based on the profile 
HMM (Hidden Markov Model), and the model training was implemented through the SAHMM (Self-Adapting Hidden Markov Model) 
software developed by the authors. 
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1 Introduction 

SARS is the first newly identified serious infectious 
disease that human being is facing at the beginning of 
the 21st century. It has been primarily recognized 
that a variant of virus from the coronavirus family 
might be the candidate pathogen of SARS, as reported 
by WHO (World Health Organization) on April 29, 
2003 (http://www.who.int/csr/sarsco untry/en). 

Coronaviruses were first isolated from chickens in 
1937. There are now approximately 15 species in this 
family. Coronavirus particles are irregularly shaped, 
round about 60-220 nm in diameter, with an outer en¬ 
velope bearing distinctive, ‘club-shaped’ peplomers 
(round about 20nm long x 10 nm at wide distal 
end) [1] . This ‘ crown-like ’ appearance (Latin, 
corona ) gives the family its name. 

The genome size of SARS-associated coronaviruses 
(isolate BJ01) is 29725kb and has 11 ORFs (Open 
Reading Frames). The whole genome is composed of 
a stable region encoding an RNA-dependent RNA poly- 
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merase (composed of 2 ORFs) and a variable region 
representing 4 CDSs (Coding Sequences) viral struc¬ 
tural genes (the S, E, M, N proteins) and 5 PUPs 
(Putative Uncharacterized Proteins) [2] . Its gene or¬ 
der is identical to that of other known coronaviruses. 

The S (Spike) protein, the N(Nucleocapsid) pro¬ 
tein and perhaps together with the M protein appear 
to be the most important candidates for the future di¬ 
agnostic testing, preventing and treatment based on 
antibodies and vaccines, as well as exploring the im- 
munoreactions [2] . Due to the limit of page space, we 
choose the M protein as an illustrated example here. 
The M protein with transmembrane-budding and en¬ 
velope formation was predicted to be a mid-sized pro¬ 
tein (221 acid amino residues). It was located at the 
nucleotide position 26379-27044 (isolate BJ01 ) [2 '. 

For the M protein, by using the Blast method and 
the ClustalW 1. 8 software (http:// www. ddbj. nig. 
ac.jp/E-mail/ clustalw-e. htm), the results on both 
the pair and the multiple sequence alignments have 
been respectively obtained and reported in literature. 
However, as far as the authors know, the multiple se¬ 
quence alignment result based on the profile HMM has 
not been seen yet. 

In this paper, we report some results about a multi¬ 
ple sequence alignment on the basis of 10 amino acid 
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sequences of the M protein, which come from differ¬ 
ent coronaviruses in NCBI databases (http://www. 
ncbi. nlm. nih. gov). They covered 4 SARS-associated 
coronaviruses isolated from patients in Canada, USA, 
and China (Beijing, Hong Kong), and 6 others: 2 
from human being (229E, Transmissible gastroenteri¬ 
tis ), 3 from house animals (Porcine, Bovine, 
Turkey), and 1 from bird (Avian). 

2 Model and Method 


The alignment model is based on the profile HMM, 
and its topology as follows^ 31 : 



Fig. 1 Topology of the profile HMM 

The model training is implemented through the 
SAHMM software developed by the authors. The 
SAHMM software includes a two-stage alternative op¬ 
timization method to maximize Bayesian posterior 
probabilities of parameters and topology for a hidden 
Markov model. 

Let M n denote the profile HMM with N main 
states, Ajv the parameter set of the profile HMM (in¬ 
cluding the state transition probabilities and the sym¬ 
bol emission probabilities), 0 = I 0 ( w) |, w = 1,2, — 
W , the training sequence set, and T w the length of 
the training sequence 0 ( “’ ) . 

The first step of two-stage alternative optimization 
method in the SAHMM software is parameter estima¬ 
tion that is to find A^ as the number of main states N 
is fixed. By using the Bayes formula, we have 

Am = arg maxP(A N | 0 ,M N ) 

V 

ocarg maxP(0 I A N ,M N )P(A N | M N ) (1) 

h* 

in which P(0 I A w , M N ) is the likelihood function of 
the training sequence set 0, P(k N \M N ) is the prior 


distribution of the parameter set k N . We use Bayesian 
Baum-Welch algorithm plus simulated annealing to es¬ 
timate the parameters A w of the profile HMM. The 
Baum-Welch algorithm is a variation of the more gen¬ 
eral EM algorithm. It iterates between an expectation 
step (E-step) and a maximization step (M-step). The 
iterative process continues until some stop rule is sat¬ 
isfied. 

The second step of two-stage alternative optimiza¬ 
tion method in the SAHMM software is the topology 
optimization that is to find the following M N • . 

M n • = arg ma xP(M N I 0) 

m n 

ccarg maxP(0|Mjv)P(M w ) (2) 

in which P(M N ) is the prior distribution of the model 
topology Ms- Under the assumption of a non-informa¬ 
tion prior distribution, we have 

P(M n I 0 )°cP( 0 I M n ) 

= j P(0\Ms,*s)PU N \Ms)dXs (3) 

Usually, the integral in Eq. (3) is difficult to calculate 
directly. Hence we use Bayesian Information Criteri¬ 
on (BIC) t4] to approximate it: 

BIC- -21ogP(Ol A^ ,Mjy) + KjylogW (4) 

where K N is the number of free parameters in the pro¬ 
file HMM with N main states, W is the sample size, 
and - logP ( 0 I A^ , Mjv) is the maximized negative 
log-likelihood of training sequence set 0. Then the 
optimum topology model M N ' is 

M n ‘ = arg mini - 21ogP(01 A,J ,M N ) + K^logWl (5) 

We have proved that P(0 |A^,Mn) is a 
monotonously increasing function with respect to N , 
so the object function of (5) is a single peak function. 
We can use various optimum methods to solve (5), e. 
g. the golden section method. 

3 Data and Results 

3.1 Data 
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Organism Accession Length Web site 


SARS corona virus BJ01 


AY278488.2 


221a. a. 


SARS coronavirus CUHK-W1 AY278554.2 221a. a. 

SARS coronavirus NC_004718.3 NC-004718.3 221a.a. 

SARS coronavirus urbani AY278741.A 221a. a. 

Transmissible gastroenteritis virus NC-002306.2 262a.a. 

Human coronavirus 229E NC_ 002645 225a. a. 

Porcine-epidemic diarrhea virus D49591 226a. a. 

Bovine coronavirus AF220295.1 230a.a. 

Turkey coronavirus JQ1172 230a.a. 

Avian-infectious bronchitis virus M95169.1 225a. a. 


http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list - uids - 30275673&dopt = GenPept 

http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list uids = 30023958&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list-uids = 29836504&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list- uids - 30027623&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list - uids = 13399294&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list-uids = 12175752&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list uids = 1360870&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list uids = 17529680&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list-uids = 77083&dopt = GenPept 

http://www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd = 

Retrieve&db = protein&list uids = 292958&dopt = GenPept 


3.2 Results 

The multiple sequence alignment of the M protein produced by the ClustalW (1.8) software 


BJOl-a MADNGTITVEELKQLLEQWNLVIGFLFLAW 

CUHK-b MADNGTITVEELKQLLEQWNLVIGFLFLAW 

NC-c MADNGTITVEELKQLLEQWNLVIGFLFLAW 

urban i-d MADNGTITVEELKQLLEQWNLVIGFLFLAW 

Transmissible-e -MKILLILACVIACACGERYCAMKSDTDLSCRNSTASDCESCFNGGDLIWHLANWNFSWSIILIVF 

human-f MSNDNCTGDIVTHLKNWNFGWNVILTIF 

porcine-g MSNGSIPVDEVIEHLRNWNFTWNIILTIL 

Bovine-h MSSVTTPAPVYTWTADEAIKFLKEWNFSLGIILLFI 

Turkey-i MSSVTTPAPVYTWTADEAIKFLKEWNFSLGIILLFI 

Avian-j MPNETNCTLDFEQSVQLFKEYNLFITAFLLFL 

BJOl-a IMLLQFAYSNRNRFLYIIKLVFLWLLWPVTLACFVLA--AVYRIN-WVTGGIAIAMACIVFLMWLS 

CUHK-b IMLLQFAYSNRNRFLYIIKLVFLWLLWPVTLACFVLA—AVYRIN-WVTGGIAIAMACIVGLMWLS 

NC-C IMLLQFAYSNRNRFLYIIKLVFLWLLWPVTLACFVLA—AVYRIN-WVTGGIAIAMACIVGLMWLS 

urbani-d IMLLQFAYSNRNRFLYIIKLVFLWLLWPVTLACFVLA--AVYRIN-WVTGGIAIAMACIVGLMWLS 
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Transmissible-e ITVLQYGRPQFSWFVYGIKMLIMWLLWPWLALTIFNAYSEYQVSRYVMFGFSIAGAIVTFVLWIM 
hurnan-f IVILQFGHYKYSRLFYGLKMLVLWLLWPLVLALSIFDTWANWDSN-WAFVAFSFFMAVSTLVMWVM 
porcine-g LWLQYGHYKYSVFLYGVKMAILWILWPLVLALSLFDAWASFQVN-WVFFAFSILMACITLMLWIM 
Bovine-h TVILQFGYTSRSMFVYVIKMVILWLMWPLTIILTIFN—CVYALN-NVYLGFSIVFTIVAIIMWIV 
Turkey-i TIILQFGYTSRSMSVYVIKMIILWLMWPLTIILTIFN—CVYALN-NVYLGFSIVFTIVAIIMWIV 
Avian-j TIILQYGYATRSKVIYTLKMIVLWCFWPLNIAVGVIS—CTYPPN-TGGLVAAIILTVFACLSFVG 


BJOl-a 

CUHK-b 

NC-c 

urbani-d 

Transmissible-e 

human-f 

porcine-g 

Bovine-h 

Turkey-i 

Avian-j 


YFVASFRLFARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVIIRGHLRMAGHSLGR- 
YFVASGRLGARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVIIRGHLRMAGHSLGR- 
YFVASFRLFARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVIIRGHLRMAGHSLGR- 
YFVASFRLFARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVIIRGHLRMAGHPLGR- 
YFVRSIQLYRRTKSWWSFNPETKAILCVSAL-GRSYVLPLEGVPTGVTLTLLSGNLYAEGFKIAGG 
YFANSFRLFRRARTFWAWNPEVNAITVTTVL-GQTYYQPIQQAPTGITVTLLSGVLYVDGHRLASG 
YFVNSIRLWRRTHSWWSFNPETDALLTTSVM-GRQVCIPVLGAPTGVTLTLLSGTLLVEGYKVATG 
YFVNSIRLFIRTGSWWSFNPETNNLMCIDMK-GRMYVRPIIEDYHTLTVTIIRGHLYMQGIKLGTG 
YFVNSIRLFIRTGSWWSFNPETNNLMCIDMK-GRMYVRPIIEDYHTLTVTIIRGHLYMQGIKLGTG 
YWIQSIRLFKRCRSWWSFNPESNAVGSILLTNGQQCNFAIESVPMVLSPIIKNGVLYCEGQWLAK- 


BJOl-a CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ— 
CUHK-b CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ— 
NC-C CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ-- 
urbani-d CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ— 
Transmissible-e MNIDNLPKYVMVALPSRTIVYTLVGKKLKASSATGWAYYVKSKAGDYSTEAR-TDNLSEQEKLLHMV 
human-f VQVHNLPEYMTVAVPSTTIIYSRVGRSVNSQNSTGWVFYVRVKHGDFSAVSSPMSNMTENERLLHFF 
porcine-g VQVSQLPNFVTVAKATTTIVYGRVGRSVNASSGTGWAFYVRSKHGDYSAVSNPSAVLTDSEKVLHLV 
Bovine-h YSLSDLPAYVTVAKVS-HLLTYKRGFLDKIGDTSGFAVYVKSKVGNYRLPSTQKGSGLDTALLRNNI 
Turkey-i YSLSDLPAYVTVAKVS-HLLTYKRGFLDKIGDTSGFAVYVKSKVGNYRLPSTQKGSGMDTALLRNNI 
Avian-j CEPDHLPKDIFVCTPDRRNIYRMVQKYTGDQSGNKKRFATFVYAKQSVDTGELESVATGGSSLYT— 


The Multiple sequence alignment of the M protein produced by the SAHMM software 


BJOl-a 

MADNGTI- 

-T VE-E-LKQLL EQWN- 

-LVI-GFLFLAWI- 

CUHK-b 

MADNGTI- 

-T VE-E-LKQLL EQWN- 

-LVI-GFLFLAWI- 

NC-c 

MADNGTI- 

-T VE-E-LKQLL EQWN- 

—LVI-GFLFLAWI- 

urbani-d 

MADNGTI- 

-T VE-E-LKQLL EQWN- 

—LVI-GFLFLAWI- 

Transmissible-e 

MKILLILACV IACACGERYC AM-K-SDTDL SCRNSTASDC 

ESCFNG-GDLIWHLANWNFS 

human-f 

M-SNDNC- 

-T GD-I—VTHL KNWNF- 

- GWN VILTIFIV 

porcine-g 

M-SNGSI- 

-P VD-E-VIEHL RNWNF- 

—TWN-IILTILLV- 

Bovine-h 

MSSVTTPAPV YTW- 

-T AD-E-AIKFL KEWNFS- 

-LGI—ILLFITV- 

Turkey-i 

MSSVTTPAPV YTW- 

-T AD-E-AIKFL KEWNFS- 

-LGI—ILLFITI- 

Avian-j 

MPNETNC- 

-T LDFEQSVQLF KEYN- 

—LFITAFLLFLTI- 
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BJOl-a 
CUHK-b 
NC-C 

urbani-d 
Transmissible-e 
human-f 
porcine-g 
Bovine-h 
Turkey-i 
Avian-j 

BJOl-a 

CUHK-b 

NC-c 

urbani-d 

Transmissible-e 

human-f 

porcine-g 

Bovine-h 

Turkey-i 

Avian-j 

BJOl-a 

CUHK-b 

NC-c 

urbani-d 

Transmissible-e 

human-f 

porcine-g 

Bovine-h 

Turkey-i 

Avian-j 

BJOl-a 

CUHK-b 

NC-c 

urbani-d 
Transmissible-e 
human-f 


- MLLQFAYSNR NRFLYIIKLV FLWLLWPVTL A-CFVLA-AV 

- MLLQFAYSNR NRFLYIIKLV FLWLLWPVTL A-CFVLA-AV 

- MLLQFAYSNR NRFLYIIKLV FLWLLWPVTL A-CFVLA-AV 

- MLLQFAYSNR NRFLYIIKLV FLWLLWPVTL A-CFVLA-AV 

WSIILIVFIT VL-QYGRPQF SWFVYGIKML IMWLLWPWL ALTIFNAYSE 

- IL-QFGHYKY SRLFYGLKML VLWLLWPLVL ALSIFDTWAN 

- VL-QYGHYKY SVFLYGVKMA ILWILWPLVL ALSLFDAWAS 

- IL-QFGYTSR SMFVYVIKMV ILWLMWPLTI ILTIFNC-VY 

- IL-QFGYTSR SMSVYVIKMI ILWLMWPLTI ILTIFNC-VY 

- IL-QYGYATR SKVIYTLKMI VLWCFWPLNI A-VGVIS-CT 

-IAIAMACIV G—LMWLSYF VASFRLFART RSMWSFNPET NILLNVPL-R 
-IAIAMACIV G—LMWLSYF VASFRLFART RSMWSFNPET NILLNVPL-R 
-IAIAMACIV G—LMWLSYF VASFRLFART RSMWSFNPET NILLNVPL-R 
-IAIAMACIV G—LMWLSYF VASFRLFART RSMWSFNPET NILLNVPL-R 
-FSIAGAIVT F—VLWIMYF VRSIQLYRRT KSWWSFNPET KAILCVSALG 
-FSFFMAVST L—VMWVMYF ANSFRLFRRA RTFWAWNPEV NAITVTTVLG 
-FSILMACIT L—MLWIMYF VNSIRLWRRT HSWWSFNPET DALLTTSV-M 

SIVFTIVAII -MWIVYF VNSIRLFIRT GSWWSFNPET NNLMCIDMKG 

SIVFTIVAII -MWIVYF VNSIRLFIRT GSWWSFNPET NNLMCIDMKG 

-LVAAIILTV FACLSFVGYW IQSIRLFKRC RSWWSFNPES NAVGSILLTN 

S-ELVIGAVI IRGHLRMAGH -SLGRCDIKD LPKEITVA-T SRTLSYYKLG 
S-ELVIGAVI IRGHLRMAGH -SLGRCDIKD LPKEITVA-T SRTLSYYKLG 
S-ELVIGAVI IRGHLRMAGH -SLGRCDIKD LPKEITVA-T SRTLSYYKLG 
S-ELVIGAVI IRGHLRMAGH -PLGRCDIKD LPKEITVA-T SRTLSYYKLG 
V-PTGVTLTL LSGNLYAEGF KIAGGMNIDN LPKYVMVALP SRTIVYTLVG 
A-PTGITVTL LSGVLYVDGH RLASGVQVHN LPEYMTVAVP STTIIYSRVG 
A-PTGVTLTL LSGTLLVEGY KVATGVQVSQ LPNFVTVAKA TTTIVYGRVG 
D-YHTLTVTI IRGHLYMQGI KLGTGYSLSD LPAYVTVAKV SHLLTY-KRG 
D-YHTLTVTI IRGHLYMQGI KLGTGYSLSD LPAYVTVAKV SHLLTY-KRG 
SVPMVLSPII KNGVLYCEGQ -WLAKCEPDH LPKDIFVCTP DRRNIYRMVQ 

SGFAAYNRYR —IGNYKLN TD-HAGSNDN IALL—VQ 
SGFAAYNRYR —IGNYKLN TD-HAGSNDN IALL—VQ 

SGFAAYNRYR -IGNYKLN TD-HAGSNDN IALL—VQ 

SGFAAYNRYR -IGNYKLN TD-HAGSNDN IALL—VQ 

TGWAYY-VKS K—AGDYSTE AR-TDNLSEQ EKLLH-MV 
TGWVFY-VRV K—HGDFSAV SSPMSNMTEN ERLLH-FF 


YRI-NWVTGG 
YRI-NWVTGG 
YRI-NWVTGG 
YRI-NWVTGG 
YQVSRYVMFG 
WDS-NWAFVA 
FQV-NWVFFA 
ALN-NVYLGF 
ALN-NVYLGF 
YPP-N—TGG 

GTIVTRPLME 

GTIVTRPLME 

GTIVTRPLME 

GTIVTRPLME 

RSYV-LPLEG 

QTYY-QPIQQ 

GRQVCIPVLG 

RMYV-RPIIE 

RMYV-RPIIE 

GQQC-NFAIE 

A—SQRVGTD 
A—SQRVGTD 
A—SQRVGTD 
A—SQRVGTD 
K—KLKASSA 
R—SVNSQNS 
R—SVNASSG 
F—LDKIGDT 
F—LDKIGDT 
KYTGDQSGNK 
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porcine-g 
Bovine-h 
Turkey-i 
Avian-j 


TGWAFY-VRS 

SGFAVY-VKS 

SGFAVY-VKS 

KRFATF-VYA 


K—HGDYSAV SNPSAVLTDS EKVLH-LV 
K—VGNYRLP ST-QKGSGLD TALLRNNI 
K—VGNYRLP ST-QKGSGMD TALLRNNI 
KQSVDTGELE SV-ATGGS-SL—YT 
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